Skip to content

Conversation

@bonnefoa
Copy link
Collaborator

@bonnefoa bonnefoa commented Nov 6, 2025

What does this PR do?

Add sent/write/flush/replay lsn delay metrics from pg_stat_replication.

Motivation

pg_stat_replication provides metrics on the last sent/write/flush/replay WAL location by a standby server.

sent delay doesn't depend on a feedback message from the standby since it tracks the sent WAL through the connection. This can be used to gauge how fast and how late a standby is when catching up

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

queries.append(QUERY_PG_REPLICATION_SLOTS)
queries.append(QUERY_PG_REPLICATION_STATS_METRICS)

P1 Badge Guard replication stats query on Aurora without logical WAL

The new QUERY_PG_REPLICATION_STATS_METRICS query now calls pg_current_wal_lsn() for every row, but it is still appended unconditionally for all Postgres ≥10 environments. On Aurora instances where wal_level is not set to logical, calling pg_current_wal_lsn() raises an error (this is why the control checkpoint metrics are skipped under the same condition a few lines above). Without a similar guard here, the check will start failing on default Aurora setups. Consider skipping this query when self.is_aurora and self.wal_level != 'logical' or using a function that is allowed on Aurora.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@codecov
Copy link

codecov bot commented Nov 6, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.31%. Comparing base (2188654) to head (9e3074f).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@bonnefoa bonnefoa force-pushed the bonnefoa/pg-walsender-lsn-delay branch from d26b527 to 8dea5ff Compare November 6, 2025 09:28
@github-actions
Copy link

github-actions bot commented Nov 6, 2025

⚠️ Major version bump
The changelog type changed or removed was used in this Pull Request, so the next release will bump major version. Please make sure this is a breaking change, or use the fixed or added type instead.

@bonnefoa bonnefoa force-pushed the bonnefoa/pg-walsender-lsn-delay branch 2 times, most recently from 45beba7 to c6437d1 Compare November 6, 2025 10:56
sethsamuel
sethsamuel previously approved these changes Nov 12, 2025
@bonnefoa
Copy link
Collaborator Author

💡 Codex Review

queries.append(QUERY_PG_REPLICATION_SLOTS)
queries.append(QUERY_PG_REPLICATION_STATS_METRICS)

P1 Badge Guard replication stats query on Aurora without logical WAL
The new QUERY_PG_REPLICATION_STATS_METRICS query now calls pg_current_wal_lsn() for every row, but it is still appended unconditionally for all Postgres ≥10 environments. On Aurora instances where wal_level is not set to logical, calling pg_current_wal_lsn() raises an error (this is why the control checkpoint metrics are skipped under the same condition a few lines above). Without a similar guard here, the check will start failing on default Aurora setups. Consider skipping this query when self.is_aurora and self.wal_level != 'logical' or using a function that is allowed on Aurora.

ℹ️ About Codex in GitHub

Aurora doesn't rely on replication slot for its replication, which will leave pg_stat_replication and pg_stat_replication_slots empty:

select * from pg_stat_replication;
 pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state | reply_time
-----+----------+---------+------------------+-------------+-----------------+-------------+---------------+--------------+-------+----------+-----------+-----------+------------+-----------+-----------+------------+---------------+------------+------------
(0 rows)

select * from pg_stat_replication_slots;
 slot_name | spill_txns | spill_count | spill_bytes | stream_txns | stream_count | stream_bytes | total_txns | total_bytes | stats_reset
-----------+------------+-------------+-------------+-------------+--------------+--------------+------------+-------------+-------------
(0 rows)

If a replication slot is created, that would be for logical replication, meaning the wal_level will be high enough to run the query on Aurora.
Without any replication slot, the query runs and return no results:

SELECT
    rep.application_name,
    rep.state,
    rep.sync_state,
    rep.client_addr,
    slot.slot_name,
    slot.slot_type,
    GREATEST (0, age(rep.backend_xmin)) as backend_xmin_age,
    pg_wal_lsn_diff(
    CASE WHEN pg_is_in_recovery() THEN pg_last_wal_receive_lsn() ELSE pg_current_wal_lsn() END, sent_lsn),
    pg_wal_lsn_diff(
    CASE WHEN pg_is_in_recovery() THEN pg_last_wal_receive_lsn() ELSE pg_current_wal_lsn() END, write_lsn),
    pg_wal_lsn_diff(
    CASE WHEN pg_is_in_recovery() THEN pg_last_wal_receive_lsn() ELSE pg_current_wal_lsn() END, flush_lsn),
    pg_wal_lsn_diff(
    CASE WHEN pg_is_in_recovery() THEN pg_last_wal_receive_lsn() ELSE pg_current_wal_lsn() END, replay_lsn),
    GREATEST (0, EXTRACT(epoch from rep.write_lag)) as write_lag,
    GREATEST (0, EXTRACT(epoch from rep.flush_lag)) as flush_lag,
    GREATEST (0, EXTRACT(epoch from rep.replay_lag)) AS replay_lag
FROM pg_stat_replication as rep
LEFT JOIN pg_replication_slots as slot
ON rep.pid = slot.active_pid;
 application_name | state | sync_state | client_addr | slot_name | slot_type | backend_xmin_age | pg_wal_lsn_diff | pg_wal_lsn_diff | pg_wal_lsn_diff | pg_wal_lsn_diff | write_lag | flush_lag | replay_lag
------------------+-------+------------+-------------+-----------+-----------+------------------+-----------------+-----------------+-----------------+-----------------+-----------+-----------+------------
(0 rows)

pg_stat_replication provides metrics on the last sent/write/flush/replay
WAL location by a standby server.

sent delay doesn't depend on a feedback message from the standby since
it tracks the sent WAL through the connection. This can be used to
gauge how fast and how late a standby is when catching up
Only run checkpoint and check metrics on the primary. This will remove
the possible uncertainty of having the Checkpoint record being correctly
propagated to the standby before standby's checkpoint is triggered.
Add the timestamp to the application name to reduce the risk of leftover
query and connection being present when metrics are collected.
@bonnefoa bonnefoa force-pushed the bonnefoa/pg-walsender-lsn-delay branch from c6437d1 to 9e3074f Compare November 12, 2025 14:24
@temporal-github-worker-1 temporal-github-worker-1 bot dismissed sethsamuel’s stale review November 12, 2025 14:24

Review from sethsamuel is dismissed. Related teams and files:

  • database-monitoring-agent
    • postgres/changelog.d/21844.added
    • postgres/datadog_checks/postgres/util.py
    • postgres/metadata.csv
    • postgres/tests/test_pg_integration.py
    • postgres/tests/test_pg_replication.py
@sethsamuel sethsamuel added this pull request to the merge queue Nov 12, 2025
Merged via the queue into master with commit 97e7ff7 Nov 12, 2025
83 of 84 checks passed
@sethsamuel sethsamuel deleted the bonnefoa/pg-walsender-lsn-delay branch November 12, 2025 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants